{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "6.3. 语言模型数据集（周杰伦专辑歌词）.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/chongzicbo/Dive-into-Deep-Learning-tf.keras/blob/master/6.3.%20%E8%AF%AD%E8%A8%80%E6%A8%A1%E5%9E%8B%E6%95%B0%E6%8D%AE%E9%9B%86%EF%BC%88%E5%91%A8%E6%9D%B0%E4%BC%A6%E4%B8%93%E8%BE%91%E6%AD%8C%E8%AF%8D%EF%BC%89.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VunRkS8UlfH3",
        "colab_type": "text"
      },
      "source": [
        "##6.3. 语言模型数据集（周杰伦专辑歌词）\n",
        "本节将介绍如何预处理一个语言模型数据集，并将其转换成字符级循环神经网络所需要的输入格式。为此，我们收集了周杰伦从第一张专辑《Jay》到第十张专辑《跨时代》中的歌词，并在后面几节里应用循环神经网络来训练一个语言模型。当模型训练好后，我们就可以用这个模型来创作歌词。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "MnEMZYrHeVUq",
        "colab_type": "code",
        "outputId": "fd7e7da1-8043-4d8a-c60e-d80c4c753437",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 64
        }
      },
      "source": [
        "%matplotlib inline\n",
        "import math\n",
        "import tensorflow as tf\n",
        "import numpy as np\n",
        "from IPython import display\n",
        "import matplotlib.pyplot as plt\n",
        "from tensorflow import keras\n",
        "from tensorflow.keras import losses\n",
        "from tensorflow.data import Dataset\n",
        "import time\n",
        "import random\n",
        "import zipfile"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<p style=\"color: red;\">\n",
              "The default version of TensorFlow in Colab will soon switch to TensorFlow 2.x.<br>\n",
              "We recommend you <a href=\"https://www.tensorflow.org/guide/migrate\" target=\"_blank\">upgrade</a> now \n",
              "or ensure your notebook will continue to use TensorFlow 1.x via the <code>%tensorflow_version 1.x</code> magic:\n",
              "<a href=\"https://colab.research.google.com/notebooks/tensorflow_version.ipynb\" target=\"_blank\">more info</a>.</p>\n"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zy9C1NYpebUS",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "tf.enable_eager_execution()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "AYHiSL23d8zT",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from google.colab import drive\n",
        "drive.mount('/content/drive')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WiBj1-M2qRxV",
        "colab_type": "text"
      },
      "source": [
        "6.3.1. 读取数据集和建立字符索引\n",
        "首先读取这个数据集，我们把换行符替换成空格，然后仅使用前1万个字符来训练模型。然后将每个字符映射成一个从0开始的连续整数，又称索引，来方便之后的数据处理。为了得到索引，我们将数据集里所有不同字符取出来，然后将其逐一映射到索引来构造词典。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "f4qXFtnVeHor",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def load_data_jay_lyrics():\n",
        "  from google.colab import drive\n",
        "  drive.mount('/content/drive')\n",
        "  with zipfile.ZipFile('/content/drive/My Drive/data/d2l-zh-tensoflow/jaychou_lyrics.txt.zip')as zin:\n",
        "    with zin.open('jaychou_lyrics.txt') as f:\n",
        "      corpus_chars=f.read().decode('utf-8')\n",
        "  corpus_chars=corpus_chars.replace('\\n',' ').replace('\\r',' ')\n",
        "  corpus_chars=corpus_chars[0:10000]\n",
        "  idx_to_char=list(set(corpus_chars))\n",
        "  char_to_idx=dict([(char,i) for i,char in enumerate(idx_to_char)])\n",
        "  vocab_size=len(char_to_idx)\n",
        "  corpus_indices=[char_to_idx[char] for char in corpus_chars]\n",
        "  return corpus_indices,char_to_idx,idx_to_char,vocab_size"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rEXZlPgre8vn",
        "colab_type": "code",
        "outputId": "f60f4cb2-fa48-4783-b87c-bd09e8b66038",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 53
        }
      },
      "source": [
        "corpus_indices,char_to_idx,idx_to_char,vocab_size=load_data_jay_lyrics()\n",
        "corpus_chars[:40]"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'想要有直升机 想要和你飞到宇宙去 想要和你融化在一起 融化在宇宙里 我每天每天每'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 12
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g3lcfW8PfRh7",
        "colab_type": "text"
      },
      "source": [
        "###6.3.2. 时序数据的采样\n",
        "在训练中我们需要每次随机读取小批量样本和标签。与之前章节的实验数据不同的是，时序数据的一个样本通常包含连续的字符。假设时间步数为5，样本序列为5个字符，即“想”“要”“有”“直”“升”。该样本的标签序列为这些字符分别在训练集中的下一个字符，即“要”“有”“直”“升”“机”。我们有两种方式对时序数据进行采样，分别是随机采样和相邻采样。\n",
        "\n",
        "####6.3.2.1. 随机采样\n",
        "下面的代码每次从数据里随机采样一个小批量。其中批量大小batch_size指每个小批量的样本数，num_steps为每个样本所包含的时间步数。 在随机采样中，每个样本是原始序列上任意截取的一段序列。相邻的两个随机小批量在原始序列上的位置不一定相毗邻。因此，我们无法用一个小批量最终时间步的隐藏状态来初始化下一个小批量的隐藏状态。在训练模型时，每次随机采样前都需要重新初始化隐藏状态。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "OLpXz24-gUTj",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def data_iter_random(corpus_indices,batch_size,num_steps):\n",
        "  #减1是因为输出的索引是相应输入的索引加1\n",
        "  num_examples=(len(corpus_indices)-1)//num_steps\n",
        "  epoch_size=num_examples//batch_size\n",
        "  example_indices=list(range(num_examples))\n",
        "  random.shuffle(example_indices)\n",
        "\n",
        "  #返回从pos开始的长为num_steps的序列\n",
        "  def _data(pos):\n",
        "    return corpus_indices[pos:pos+num_steps]\n",
        "\n",
        "  for i in range(epoch_size):\n",
        "    #每次读取batch_size个随机样本\n",
        "    i=i*batch_size\n",
        "    batch_indices=example_indices[i:i+batch_size]\n",
        "    X=[_data(j*num_steps) for j in batch_indices]\n",
        "    Y=[_data(j*num_steps+1) for j in batch_indices]\n",
        "    yield tf.constant(X),tf.constant(Y)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U4KnDK12rLnP",
        "colab_type": "text"
      },
      "source": [
        "让我们输入一个从0到29的连续整数的人工序列。设批量大小和时间步数分别为2和6。打印随机采样每次读取的小批量样本的输入X和标签Y。可见，相邻的两个随机小批量在原始序列上的位置不一定相毗邻。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "moMxkshPji56",
        "colab_type": "code",
        "outputId": "f3ed751c-ecbb-4cae-ddb9-8c75d4ca90c9",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 269
        }
      },
      "source": [
        "my_seq=list(range(30))\n",
        "for X,Y in data_iter_random(my_seq,batch_size=2,num_steps=6):\n",
        "  print('X:',X,'\\nY:',Y,'\\n')"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "X: tf.Tensor(\n",
            "[[18 19 20 21 22 23]\n",
            " [12 13 14 15 16 17]], shape=(2, 6), dtype=int32) \n",
            "Y: tf.Tensor(\n",
            "[[19 20 21 22 23 24]\n",
            " [13 14 15 16 17 18]], shape=(2, 6), dtype=int32) \n",
            "\n",
            "X: tf.Tensor(\n",
            "[[ 6  7  8  9 10 11]\n",
            " [ 0  1  2  3  4  5]], shape=(2, 6), dtype=int32) \n",
            "Y: tf.Tensor(\n",
            "[[ 7  8  9 10 11 12]\n",
            " [ 1  2  3  4  5  6]], shape=(2, 6), dtype=int32) \n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rBB-zvu_rQcj",
        "colab_type": "text"
      },
      "source": [
        "####6.3.2.2. 相邻采样\n",
        "除对原始序列做随机采样之外，我们还可以令相邻的两个随机小批量在原始序列上的位置相毗邻。这时候，我们就可以用一个小批量最终时间步的隐藏状态来初始化下一个小批量的隐藏状态，从而使下一个小批量的输出也取决于当前小批量的输入，并如此循环下去。这对实现循环神经网络造成了两方面影响：一方面， 在训练模型时，我们只需在每一个迭代周期开始时初始化隐藏状态；另一方面，当多个相邻小批量通过传递隐藏状态串联起来时，模型参数的梯度计算将依赖所有串联起来的小批量序列。同一迭代周期中，随着迭代次数的增加，梯度的计算开销会越来越大。 为了使模型参数的梯度计算只依赖一次迭代读取的小批量序列，我们可以在每次读取小批量前将隐藏状态从计算图中分离出来。我们将在“循环神经网络的从零开始实现”一节的实现中了解这种处理方式。\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "iFQI3k1GkFNB",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def data_iter_consecutive(corpus_indices,batch_size,num_steps):\n",
        "  corpus_indices=tf.constant(corpus_indices)\n",
        "  data_len=len(corpus_indices)\n",
        "  batch_len=data_len//batch_size\n",
        "  indices=tf.reshape(corpus_indices[0:batch_size*batch_len],shape=(batch_size,batch_len))\n",
        "  epoch_size=(batch_len-1) // num_steps\n",
        "  for i in range(epoch_size):\n",
        "    i=i*num_steps\n",
        "    X=indices[:,i:i+num_steps]\n",
        "    Y=indices[:,i+1:i+num_steps+1]\n",
        "    yield X,Y"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "l4AB-EerrXZM",
        "colab_type": "text"
      },
      "source": [
        "同样的设置下，打印相邻采样每次读取的小批量样本的输入X和标签Y。相邻的两个随机小批量在原始序列上的位置相毗邻。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "y-71EAnrmIb3",
        "colab_type": "code",
        "outputId": "e5396220-1634-4848-d7df-eca31cb58c45",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 269
        }
      },
      "source": [
        "for X,Y in data_iter_consecutive(my_seq,batch_size=2,num_steps=6):\n",
        "  print('X:',X,'\\nY:',Y,'\\n')"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "X: tf.Tensor(\n",
            "[[ 0  1  2  3  4  5]\n",
            " [15 16 17 18 19 20]], shape=(2, 6), dtype=int32) \n",
            "Y: tf.Tensor(\n",
            "[[ 1  2  3  4  5  6]\n",
            " [16 17 18 19 20 21]], shape=(2, 6), dtype=int32) \n",
            "\n",
            "X: tf.Tensor(\n",
            "[[ 6  7  8  9 10 11]\n",
            " [21 22 23 24 25 26]], shape=(2, 6), dtype=int32) \n",
            "Y: tf.Tensor(\n",
            "[[ 7  8  9 10 11 12]\n",
            " [22 23 24 25 26 27]], shape=(2, 6), dtype=int32) \n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dQ4tnyCpmOxJ",
        "colab_type": "text"
      },
      "source": [
        "###6.3.3. 小结\n",
        "时序数据采样方式包括随机采样和相邻采样。使用这两种方式的循环神经网络训练在实现上略有不同。"
      ]
    }
  ]
}